Skip to content

feat: distributed vector search via index segment selection#24

Draft
jja725 wants to merge 4 commits intomainfrom
feat/distributed-vector-search
Draft

feat: distributed vector search via index segment selection#24
jja725 wants to merge 4 commits intomainfrom
feat/distributed-vector-search

Conversation

@jja725
Copy link
Copy Markdown
Collaborator

@jja725 jja725 commented Apr 24, 2026

Exposes Lance's segment-model APIs through the C ABI so a distributed query engine (Velox, Presto worker, etc.) can fan a single k-NN query out across workers, each scanning a slice of the logical index's physical segments. Tracks lance#6309.

Distributed query pattern

Coordinator                                  Worker(s)
─────────────────                            ───────────────
open dataset
list segments  ──────── slice ──────────►   open same dataset
                                             scanner.nearest(q, k)
                                             scanner.index_segments(my_slice)
                                             return partial top-k stream
heap-merge partial top-k  ◄─────────────────  (Velox top-k operator handles this)

Summary

lance_dataset_index_segment_count(ds, name) — number of physical segments in a logical vector index. Returns 0 + LANCE_ERR_NOT_FOUND for an unknown name.

lance_dataset_index_segments(ds, name, out_uuids) — fills a caller-allocated buffer (count * 16 bytes) with each segment's 16-byte UUID (RFC 4122).

lance_scanner_set_index_segments(scanner, segment_uuids, len) — restricts the next lance_scanner_nearest() query to a subset of segments. len=0 (any pointer) clears the restriction.

C++ wrappers:

  • Dataset::index_segment_count(name)uint64_t
  • Dataset::index_segments(name)std::vector<std::array<uint8_t, 16>>
  • Scanner::index_segments(uuids) (typed vector overload + raw uint8_t* + len overload) — fluent

Lance dep bump

To get Scanner::with_index_segments() (merged in lance #6376) we bump from crates.io lance = \"3.0.1\" to a git+rev pin at lance commit d630106d (release tag v5.0.0-beta.5). beta-5 keeps arrow on 57.0.0 — no transitive arrow churn. The DatasetIndexExt trait moved from lance_index to lance::index; one import path adjusted in src/index.rs.

When lance publishes 5.0.0 stable, the git+rev can be replaced with the version pin.

Test plan

  • cargo fmt clean
  • cargo clippy --all-targets -- -D warnings clean
  • cargo test75 passed (70 from main + 5 new)
  • cargo test --test compile_and_run_test -- --ignored — 2 passed (C + C++ smoke)

New tests:

  • test_index_segment_count_and_list — build IVF index, count = 1, list returns a non-zero UUID.
  • test_index_segment_count_unknown_index — unknown name → NotFound.
  • test_scanner_set_index_segments_with_listed_uuids — end-to-end k=5 nearest restricted to listed segment UUID, returns 5 results.
  • test_scanner_set_index_segments_unknown_uuid — bogus UUID is accepted at setter time, surfaces as an error at scan materialize time with a message containing "segment".
  • test_scanner_set_index_segments_null_safety — NULL scanner / NULL pointer with len>0 / NULL with len=0 (clears).

Follow-ups (not in this PR)

  • Per-segment metadata: today we only expose UUID. A future pass could add fragment_bitmap / dataset_version / num_indexed_rows so coordinators can balance work by segment size.
  • Distributed build: commit_existing_index_segments() and merge_existing_index_segments() exist upstream — they'd let workers each train one segment and the coordinator commit them atomically.
  • Once lance publishes 5.0.0 stable, replace the git+rev pin with a version pin.

🤖 Generated with Claude Code

jja725 added 4 commits April 24, 2026 16:03
Switches lance / lance-core / lance-index / lance-io / lance-linalg
from crates.io 3.0.1 to a git+rev pin at lance commit d630106d
(release tag v5.0.0-beta.5). Adds lance-datagen / lance-file /
lance-table dev deps from the same rev.

Beta-5 introduces the segment-model APIs (Scanner::with_index_segments,
commit_existing_index_segments) that subsequent commits expose through
the C ABI for distributed vector search.

The DatasetIndexExt trait moved from lance_index to lance::index;
src/index.rs adjusts the import. arrow stays on 57.0.0 (matches beta-5).
Adds uuid 1.x for the upcoming UUID-based segment API.
Adds lance_dataset_index_segment_count(name) and
lance_dataset_index_segments(name, out_uuids) to enumerate the
physical segments of a logical vector index. Each segment is
identified by its 16-byte UUID (RFC 4122 layout, written as raw
bytes into a caller-allocated buffer of len*16 bytes).

The header also declares lance_scanner_set_index_segments (impl
in the next commit). Both pieces together let a distributed query
engine like Velox shard a single k-NN query across workers.
Five tests covering the new segment APIs:
- index_segment_count + listing UUIDs from a freshly built IVF index
- segment_count returns NotFound for an unknown index name
- end-to-end nearest scoped to listed segment UUIDs returns k results
- unknown UUID surfaces an error at scan materialize time (not at
  setter time)
- NULL safety for the scanner setter (NULL scanner; NULL ptr with
  non-zero len; NULL ptr with len=0 clears successfully)
Dataset::index_segment_count(name) and Dataset::index_segments(name)
return std::vector<std::array<uint8_t, 16>>. Scanner::index_segments
takes either the typed vector or a raw byte pointer + length.

C++ smoke test verifies the wrappers compile and link (no real
distributed dataset to exercise against in the test fixture).
Comment thread include/lance.h
const LanceDataset* dataset,
const char* index_name,
uint8_t* out_uuids
);
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer is sized by the caller using a separate lance_dataset_index_segment_count() call. The implementation reloads the snapshot independently in each call (snap.load_indices() is invoked twice), and the inner loop writes count * 16 bytes without any capacity check.

The C++ wrapper makes this two-call pattern explicit:

uint64_t count = index_segment_count(index_name);  // call #1: snapshot load
std::vector<std::array<uint8_t, 16>> out(count);
... lance_dataset_index_segments(...)              // call #2: snapshot load

Between call #1 and call #2, a concurrent writer could commit a new segment for the same logical index — exactly the distributed-build use case mentioned in the follow-ups section of the PR description. The second snapshot would then return more segments than the first, and the inner loop at src/index.rs:255–260 would overrun the caller's buffer:

for (i, seg) in segments.iter().enumerate() {
    let bytes = seg.uuid.as_bytes();
    unsafe {
        std::ptr::copy_nonoverlapping(bytes.as_ptr(), out_uuids.add(i * 16), 16);
    }
}

There is no SAFETY: comment justifying why out_uuids is large enough.


Possible Fixes

Adopt the well-established "capacity in, count out" FFI pattern (commonly seen in raw C APIs that fill caller-provided buffers):

int32_t lance_dataset_index_segments(
    const LanceDataset* dataset,
    const char* index_name,
    uint8_t* out_uuids,
    size_t capacity,        /* bytes available in out_uuids */
    uint64_t* out_count     /* how many UUIDs were actually written */
);

Reuse LANCE_ERR_INVALID_ARGUMENT (or introduce a new sentinel — the codebase currently has 8: LANCE_ERR_INVALID_ARGUMENT, LANCE_ERR_IO, LANCE_ERR_NOT_FOUND, LANCE_ERR_DATASET_ALREADY_EXISTS, LANCE_ERR_INDEX, LANCE_ERR_INTERNAL, LANCE_ERR_NOT_SUPPORTED, LANCE_ERR_COMMIT_CONFLICT) when capacity < segments.len() * 16. This also lets callers do single-shot retrieval with a guess and re-allocate if needed, removing the two-snapshot anti-pattern entirely.

Lighter-weight alternative: have a single Rust call return the count and a heap-allocated buffer with the segments. The codebase already exposes lance_free_string for CString-style strings; an analogous lance_free_uuid_buffer (or generic lance_free_bytes) would be a small, well-scoped addition. This eliminates caller-side sizing altogether at the cost of an extra allocation.

Comment thread src/scanner.rs
scanner.prefilter(true);
}
if let Some(segments) = &self.index_segments {
scanner.with_index_segments(segments.clone())?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The with_index_segments(...) call is placed inside if let Some(n) = &self.nearest { ... }:

if let Some(n) = &self.nearest {
    scanner.nearest(&n.column, n.query.as_ref(), n.k as usize)?;
    ...
    if let Some(segments) = &self.index_segments {
        scanner.with_index_segments(segments.clone())?;
    }
}

If a caller invokes lance_scanner_set_index_segments(...) but never calls lance_scanner_nearest(...), the segment restriction is silently ignored — no error, no warning. For a distributed-query worker scanning the wrong segments, this is a correctness footgun.

Recommended fix. Either:

  1. Validate at materialize time — return an error if index_segments.is_some() && nearest.is_none() with a message such as "index_segments requires nearest() to be configured".
  2. Validate at setter time — in lance_scanner_set_index_segments, reject if s.nearest.is_none() and document the ordering requirement (consistent with the project's existing fail-fast guards such as the if k == 0 { ... } check inside scanner_nearest_inner).

Option 1 is more flexible (allows a builder to set segments before nearest); option 2 fails earlier and is closer to the rest of the file's style.

Comment thread src/index.rs
let snap = ds.snapshot();
match block_on(snap.load_indices()) {
Ok(indices) => {
let count = indices.iter().filter(|i| i.name == name).count();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lance_dataset_index_count excludes system indexes (!is_system_index). The new lance_dataset_index_segment_count does not. If a system index ever shares a name with a user-visible index, the count silently includes it and lance_dataset_index_segments emits its UUIDs, which a worker may then attempt to query.

Comment thread src/index.rs
})?;
let snap = ds.snapshot();
let indices = block_on(snap.load_indices())?;
let segments: Vec<_> = indices.iter().filter(|i| i.name == name).collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@jja725 jja725 marked this pull request as draft April 27, 2026 16:55
@jja725
Copy link
Copy Markdown
Collaborator Author

jja725 commented Apr 27, 2026

@LuciferYang thanks for the review, would address comments when free. this is still an in progress PR which need stable 5.0.0 release so we can adopt the segment model for distributed index search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants